Skip to content

P2P hardening: fix bucket refresh, pool contention, server RW timeouts, ignore filter & NodeList slice bug#145

Merged
mateeullahmalik merged 2 commits intomasterfrom
p2pImprovements
Aug 31, 2025
Merged

P2P hardening: fix bucket refresh, pool contention, server RW timeouts, ignore filter & NodeList slice bug#145
mateeullahmalik merged 2 commits intomasterfrom
p2pImprovements

Conversation

@mateeullahmalik
Copy link
Copy Markdown
Collaborator

@mateeullahmalik mateeullahmalik commented Aug 30, 2025

P2P hardening: fix bucket refresh, pool contention, server RW timeouts, ignore filter & NodeList slice bug

Summary

In testnet, StoreArtefacts occasionally hung and supernodes gradually “lost” the network (failing to discover/reconnect). This PR addresses several correctness and resiliency issues in the P2P/Kademlia layer that could cause handler leaks, global stalls during handshakes, incorrect LRU maintenance, futile re-dials to ignored/self nodes, and malformed peer lists.

Changes

  • Server I/O timeouts: add per-message SetReadDeadline/SetWriteDeadline (~30s) in handleConn
    File: supernode/p2p/kademlia/network.go
  • Pool lock narrowing: stop holding connPoolMtx across NewSecureClientConn (TLS/TCP handshake) using double-check pattern
    File: supernode/p2p/kademlia/network.go
  • Correct bucket refresh: call refreshNode with node.ID (raw), and no-op if not found to preserve LRU
    Files: supernode/p2p/kademlia/dht.go, supernode/p2p/kademlia/hashtable.go
  • Ignore filter correctness: key ignoredMap by string(ID) and look up with the same representation; also filter includeNode
    File: supernode/p2p/kademlia/hashtable.go
  • NodeList integrity: fix NodeIDs()/NodeIPs() to allocate with capacity and append (no double-length / empty entries)
    File: supernode/p2p/kademlia/node.go
  • RPC timeouts (clarified): ensure explicit exec timeouts for slow paths (store/replicate/batch find)
    File: supernode/p2p/kademlia/network.go
  • Minor defensive guards: nil-safety around response handling; comments and small hygiene where relevant

Why this helps StoreArtefacts

  • Server RW deadlines prevent half-open/slow peers from pinning handler goroutines indefinitely (no FD/goroutine leak over time).
  • Handshake outside pool lock prevents a single slow dial from globally blocking unrelated RPCs (parallel StoreArtefacts/finds keep flowing).
  • Correct LRU refresh + working ignore filter improve target selection and avoid re-contacting self/banned nodes.
  • Clean NodeList output removes subtle routing/replication skew from empty IDs.

Risk / Mitigations

  • Timeouts are conservative and apply per message; they can be tuned if we see premature closes in high-latency environments.
  • Pool double-check closes the losing connection to avoid leaks.
  • No changes to protobuf or public SDK surfaces.

Test Plan

  • Unit/integration (race):

@mateeullahmalik mateeullahmalik merged commit 98eb76d into master Aug 31, 2025
7 checks passed
@mateeullahmalik mateeullahmalik deleted the p2pImprovements branch September 5, 2025 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants